In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import plotly.express as px

# Import data
df = pd.read_csv("gapminder_clean.csv")

CO2 emissions (metric tons per capita) and GDP per cap. Year 1962

Go to the table of contents

In [2]:
#Remove NaN values for "CO2 emissions (metric tons per capita)" stats
df_CO2_clean = df.copy()
df_CO2_clean.dropna(subset = ["CO2 emissions (metric tons per capita)"], inplace=True)
df_CO2_clean.dropna(subset = ["gdpPercap"], inplace=True)


filtered_data = df_CO2_clean[(df_CO2_clean["Year"] == 1962) & (df_CO2_clean["Country Name"] != "Kuwait")]
# Scatter plot
fig = px.scatter(filtered_data, x='CO2 emissions (metric tons per capita)', 
                 y="gdpPercap", color='Country Name', size='pop', hover_data=['Country Name'], 
                 title="CO2 emissions (metric tons per capita) and GDP per cap. Year 1962")
fig.show()
In [3]:
# Pearson's r
from scipy.stats import pearsonr

corr, p_value = pearsonr(filtered_data["CO2 emissions (metric tons per capita)"], filtered_data["gdpPercap"])
print("\n Pearson correlation of CO2 emissions (metric tons per capita) and gdpPercap, year 1962: \n",
      "Correlation value: ",corr,"p-value: ",p_value)
 Pearson correlation of CO2 emissions (metric tons per capita) and gdpPercap, year 1962: 
 Correlation value:  0.8063294717615218 p-value:  1.0822253072448907e-25

In what year is the correlation between CO2 emissions (metric tons per capita) and gdpPercap the strongest?

Go to the table of contents

In [4]:
unfiltered_data = df_CO2_clean[(df_CO2_clean["Year"] != 1962) & (df_CO2_clean["Country Name"] != "Kuwait")]
unfiltered_data.groupby(by=["Year"]).corrwith(other=df_CO2_clean["CO2 emissions (metric tons per capita)"]).sort_values("gdpPercap", ascending=False)["gdpPercap"].head(1)
Out[4]:
Year
1972    0.859351
Name: gdpPercap, dtype: float64
In [5]:
new_filtered_data = df_CO2_clean[(df_CO2_clean["Year"] == 1972) & (df_CO2_clean["Country Name"] != "Kuwait")]

fig = px.scatter(new_filtered_data, x='CO2 emissions (metric tons per capita)', y="gdpPercap", color="continent",
                 size='pop', hover_data=['Country Name'], 
                 title="CO2 emissions (metric tons per capita) and GDP per cap. Year 1972")
fig.show()

What is the relationship between continent and 'Energy use (kg of oil equivalent per capita)'?

Go to the table of contents

In [6]:
df.groupby("continent")["Energy use (kg of oil equivalent per capita)"].describe()
Out[6]:
count mean std min 25% 50% 75% max
continent
Africa 199.0 698.516783 627.356473 9.715410 375.184208 449.521247 745.393302 3071.774832
Americas 188.0 1703.620453 2377.181918 219.075497 556.033108 749.029108 1384.585146 14608.009868
Asia 185.0 1867.280336 2590.043514 86.903767 345.370792 760.140852 1987.087308 12122.050603
Europe 256.0 3146.062066 1733.880414 350.101258 2073.999447 3027.931793 4034.557831 14746.031339
Oceania 20.0 3980.314420 1123.410756 1791.461322 3143.501420 4044.850674 4783.650230 5868.347097
In [7]:
fig = px.box(df, x="Energy use (kg of oil equivalent per capita)", y="continent", hover_data=['Country Name'])
fig.show()

I need to know if the data satisfy parametric requirements to use parametric tests.

  1. The population from which samples are drawn should be normally distributed. -> Shapiro-Wilk test.
  2. Independence of cases: the sample cases should be independent of each other. -> I assume that this condition is satisfied.
  3. Homogeneity of variance: Homogeneity means that the variance among the groups should be approximately equal. -> Levene test

First, I separate the "Energy use" data per continent, (I'll have five series, one per each continent), and I remove missing values to avoid errors on tests.

In [8]:
americas_energy = df[df["continent"] == "Americas"]["Energy use (kg of oil equivalent per capita)"].dropna()
oceania_energy = df[df["continent"] == "Oceania"]["Energy use (kg of oil equivalent per capita)"].dropna()
africa_energy = df[df["continent"] == "Africa"]["Energy use (kg of oil equivalent per capita)"].dropna()
europe_energy = df[df["continent"] == "Europe"]["Energy use (kg of oil equivalent per capita)"].dropna()
asia_energy = df[df["continent"] == "Asia"]["Energy use (kg of oil equivalent per capita)"].dropna()
Shapiro-Wilk test

The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution (first requeriment to use parametric tests).

In [9]:
import scipy.stats as stats
print("\n",
"Americas: ",stats.shapiro(americas_energy), " Reject null hypotesis","\n",
"Oceania: ",stats.shapiro(oceania_energy),"Can not reject null hypotesis","\n",
"Africa: ",stats.shapiro(africa_energy)," Reject null hypotesis","\n",
"Europe: ",stats.shapiro(europe_energy)," Reject null hypotesis","\n",
"Asia: ",stats.shapiro(asia_energy)," Reject null hypotesis")
 Americas:  ShapiroResult(statistic=0.5632225871086121, pvalue=1.5868403861741054e-21)  Reject null hypotesis 
 Oceania:  ShapiroResult(statistic=0.9818098545074463, pvalue=0.9552662372589111) Can not reject null hypotesis 
 Africa:  ShapiroResult(statistic=0.6747236251831055, pvalue=2.334312218924711e-19)  Reject null hypotesis 
 Europe:  ShapiroResult(statistic=0.8964672088623047, pvalue=2.9994249271803053e-12)  Reject null hypotesis 
 Asia:  ShapiroResult(statistic=0.6609910130500793, pvalue=4.943039538112358e-19)  Reject null hypotesis

Shapiro-Wilk tests: rejected 4/5 null hypotheses. Data is not normal distributed.

Levene test

The Levene test tests the null hypothesis that all input samples are from populations with equal variances (third requeriment to use parametric tests).

In [10]:
import scipy.stats as stats
stats.levene(*[americas_energy, oceania_energy, africa_energy,europe_energy, asia_energy], 
             center='median', proportiontocut=0.05)
Out[10]:
LeveneResult(statistic=12.424040765211972, pvalue=8.003283101567902e-10)

Leneve test: rejected null hypothesis.

Shapiro-Wilk and Leneve tests were rejected. Data doesn't satisfy parametric requeriments. So I need to use non-parametric tests.

Kruskal-Wallis H-test

The Kruskal-Wallis H-test tests the null hypothesis that the population median of all of the groups are equal.

In [11]:
stats.kruskal(*[americas_energy, oceania_energy, africa_energy,europe_energy, asia_energy])
Out[11]:
KruskalResult(statistic=318.67631745519566, pvalue=1.012403877816621e-67)

Kruskal-Wallis H-test: rejected null hypothesis.

So I have to compare the means to detect the differences and similarities between continents.

Dunn’s test

Post hoc pairwise test for multiple comparisons of mean rank sums. This test is run after Kruskal-Wallis's one-way analysis of variance by ranks to do pairwise comparisons.

In [12]:
import scikit_posthocs as sp

dunn_test = sp.posthoc_dunn([americas_energy, oceania_energy, africa_energy,europe_energy, asia_energy])
dunn_test.columns =["America","Oceania","Africa","Europe","Asia"]
dunn_test.index =["America","Oceania","Africa","Europe","Asia"]
dunn_test
Out[12]:
America Oceania Africa Europe Asia
America 1.000000e+00 4.861527e-08 6.617961e-10 5.920341e-22 2.009379e-01
Oceania 4.861527e-08 1.000000e+00 3.693759e-16 1.227937e-01 1.802074e-09
Africa 6.617961e-10 3.693759e-16 1.000000e+00 1.099123e-60 1.218580e-06
Europe 5.920341e-22 1.227937e-01 1.099123e-60 1.000000e+00 6.037190e-28
Asia 2.009379e-01 1.802074e-09 1.218580e-06 6.037190e-28 1.000000e+00
In [13]:
fig = px.imshow(dunn_test, title="Dunn's test between continents' energy use means")
fig.show()

Asia's and Americas' energy use means are similar.

Oceania's and Europe's energy use means are similar.

Is there a significant difference between Europe and Asia with respect to 'Imports of goods and services (% of GDP)' in the years after 1990?

Go to the table of contents

In [14]:
europe_and_asia_after_1990 = df[((df["continent"] == "Europe") |
                (df["continent"] == "Asia")) & (df["Year"] > 1990) & 
                (df['Imports of goods and services (% of GDP)'] < 97 ) ] # Deleted Outliers (Singapore is an exception)
europe_and_asia_after_1990.groupby("continent")["Imports of goods and services (% of GDP)"].describe()
Out[14]:
count mean std min 25% 50% 75% max
continent
Asia 94.0 41.846948 23.363432 0.079506 25.47907 39.161401 58.32903 96.742045
Europe 114.0 41.789241 16.721352 17.345130 28.81216 37.787354 51.14954 88.512248
In [15]:
fig = px.box(europe_and_asia_after_1990, x="Imports of goods and services (% of GDP)", y="continent", hover_data=['Country Name', "Year"])
fig.show()
In [16]:
europe_imports = europe_and_asia_after_1990[europe_and_asia_after_1990["continent"] == "Europe"]["Imports of goods and services (% of GDP)"].dropna()
asia_imports = europe_and_asia_after_1990[europe_and_asia_after_1990["continent"] == "Asia"]["Imports of goods and services (% of GDP)"].dropna()
print("\n", stats.shapiro(europe_imports), "\n", stats.shapiro(asia_imports))
stats.levene(*[europe_imports,asia_imports], center='median', proportiontocut=0.05)
 ShapiroResult(statistic=0.9290487170219421, pvalue=1.3633670278068166e-05) 
 ShapiroResult(statistic=0.9722954034805298, pvalue=0.04313851520419121)
Out[16]:
LeveneResult(statistic=10.40513543276282, pvalue=0.001461628537398237)

  • Shapiro-Wilk tests: rejected 1/2 null hypotheses. Data is not normal distributed.
  • Leneve test: rejected null hypothesis.

Like the anterior case, parametric requirements are not satisfied. I need to compare two means using a non-parametric test.

Mann-Whitney U test

The Mann-Whitney U test is used to compare differences between two independent groups when the dependent variable is either ordinal or continuous, but not normally distributed.

In [17]:
stats.mannwhitneyu(x=europe_imports, y=asia_imports)
Out[17]:
MannwhitneyuResult(statistic=5251.0, pvalue=0.40264017492355547)

Can not reject the null hypothesis of identical average scores.

What is the country (or countries) that has the highest 'Population density (people per sq. km of land area)' across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?)

Go to the table of contents

I'll group the records by Country Name, then I'll calculate the pop density mean per country

In [18]:
df.groupby("Country Name")["Population density (people per sq. km of land area)"].mean().sort_values(ascending=False).head()
Out[18]:
Country Name
Macao SAR, China        14732.035211
Monaco                  14089.900000
Hong Kong SAR, China     5153.056903
Singapore                4361.499928
Gibraltar                2622.250000
Name: Population density (people per sq. km of land area), dtype: float64
In [19]:
fig = px.line(df, x="Year", y="Population density (people per sq. km of land area)", color="Country Name",
              line_group="Country Name", hover_name="Country Name", 
              title="Population density (people per sq. km of land area) across all years")
fig.show()

What country (or countries) has shown the greatest increase in 'Life expectancy at birth, total (years)' since 1962?

Go to the table of contents

I'll extract the first and the last record that contains the "Life expectancy at birth" value for each country (not the minimum and maximum values), I'll subtract the last minus the first value and then I'll calculate the relative increment, in percentage:

relative increment (%) = (last record - first record)/first record * 100

In [20]:
life_expectancy_by_country = df.groupby('Country Name')['Life expectancy at birth, total (years)'].agg(['last','first'])
life_expectancy_by_country['diff'] = life_expectancy_by_country['last'] - life_expectancy_by_country['first']
life_expectancy_by_country['percentage'] = life_expectancy_by_country['diff'] / life_expectancy_by_country['first'] * 100
life_expectancy_by_country["Country Name"] = life_expectancy_by_country.index
life_expectancy_by_country.sort_values(by="percentage", ascending=False).head()
Out[20]:
last first diff percentage Country Name
Country Name
Bhutan 66.293098 33.094146 33.198951 100.316687 Bhutan
Maldives 75.399707 38.483561 36.916146 95.927054 Maldives
Mali 54.261927 28.548463 25.713463 90.069518 Mali
Timor-Leste 65.824195 34.739049 31.085146 89.481858 Timor-Leste
Nepal 66.551927 35.952293 30.599634 85.111774 Nepal
In [21]:
life_expectancy_by_country_values_greater_than_zero = life_expectancy_by_country[life_expectancy_by_country["percentage"] > 0] #Removed negative values because cause error
fig = px.scatter(life_expectancy_by_country_values_greater_than_zero, x='percentage', y="last", color="percentage", 
                 title="Changes in life expectancy at birth, relative value (%)",
                 size='percentage', hover_data=['Country Name'],
                labels={
                     "last": "Life expectancy at birth, total (years). Last record.",
                     "percentage": "Difference between first and last record, (percentage)"
                 },)
fig.show()

Changes in life expectancy at birth, absolute value (Years)

Also I can compare between absolute values, getting different results:

absolute increment (years) = last record - first record

In [22]:
life_expectancy_by_country.sort_values(by="diff", ascending=False).head()
Out[22]:
last first diff percentage Country Name
Country Name
Maldives 75.399707 38.483561 36.916146 95.927054 Maldives
Bhutan 66.293098 33.094146 33.198951 100.316687 Bhutan
Timor-Leste 65.824195 34.739049 31.085146 89.481858 Timor-Leste
Tunisia 74.202439 43.341683 30.860756 71.203410 Tunisia
Oman 75.123610 44.300512 30.823098 69.577294 Oman
In [23]:
life_expectancy_by_country_values_greater_than_zero = life_expectancy_by_country[life_expectancy_by_country["diff"] > 0] #Removed negative values because cause error
fig = px.scatter(life_expectancy_by_country_values_greater_than_zero, x='diff', y="last", color="diff", 
                 title="Changes in life expectancy at birth, absolute value (Years)",
                 size='diff', hover_data=['Country Name'],
                labels={
                     "last": "Life expectancy at birth, total (years). Last record.",
                     "diff": "Difference between first and last record, (years)"
                 },)
fig.show()